Clustering

Published

July 11, 2024

Goal

In this lab we will learn the basics of clustering. The methods we will cover include hierarchical clustering, K means and density clustering.

Packages

Install and load packages.

pkgs_needed = c(
  "tidyverse",
  "dbscan", "GGally", "pheatmap",
  "flowCore","flowViz","flowPeaks", "ggcyto"
)
letsinstall = setdiff(pkgs_needed, installed.packages()) 
if (length(letsinstall) > 0) {
  for (pkg in letsinstall)
    BiocManager::install(pkg)
}

Hierarchical clustering

The Morder data are gene expression measurements for 156 genes on T cells of 3 types (naïve, effector, memory) from 10 patients (Holmes et al. 2005).

Here we load the Morder data.frame from the online directory.

load(url("http://web.stanford.edu/class/bios221/data/Morder.RData"))

# if download fails with timeout error, try increasing it
# options(timeout = 1000)

If you downloaded the file before, you can load it from the local directory.

load("Morder.RData")
Question

Inspect the data. How many samples of each T cell type (naïve, effector, memory) are there in the data?

rownames(Morder) |> 
  strsplit(split = "_") |> 
  sapply(function(x) x[2]) |> 
  table()

EFF MEM NAI 
 10   9  11 

In base R the function to perform hierarchical clustering is hclust. To cluster the genes with hierarchical clustering you first need to compute a distance matrix storing all pairwise (gene-to-gene) dissimilarities. This is how to do it:

# distance calculation
D = dist(t(Morder))
# clustering
gene_clust = hclust(d = D)
# plot dendrogram
plot(gene_clust)

Question

Why in the provided code the input to dist function is t(Morder)?

Because dist computes distances between rows, we need to use t to transposes the expression matrix Morder, so that we cluster genes, and not samples.

In hierarchical clustering one needs to choose the method for agglomerating the clusters. By default hclust uses a “complete” linkage method (see ?hclust for info).

Question

Redo hierarchical clustering with the ward.D2 method and plot the dendrogram.

gene_clust_wd2 = hclust(d = dist(t(Morder)), method = "ward.D2")
plot(gene_clust_wd2)

Notice that the values on the y axis of hclust dendrogram changed. What do they correspond to?

Distance between clusters, according to agglomeration method choice.

Note that in the hclust there are the ward.D and ward.D2 methods available. Please call ?hclust to read about the difference between the two methods.

Next, instead of clustering genes, we will apply hierarchical clustering for samples (observations)

Question

Use dist and hclust (with default linkage method) to cluster samples.

# we don't transpose the matrix now (samples are rows)
D_samples = dist(Morder)
sample_clust = hclust(d = D_samples)

How many clusters of samples are there at the dendrogram height of 12? Hint: the abline function might be helpful.

plot(sample_clust)
abline(a=12, b=0, col="blue")

We see 3 clusters at the height of 12

Now that you know how to perform hierarchical clustering, use pheatmap to generate a heatmap with clustered rows and columns. Below, we do some extra work compared to the default parameters to make sure that 0 is in the center of the color scale, and use beautiful colors.

library("pheatmap")
mycolors = colorRampPalette(c("#FFD800", "#0056B9"))(100)
pheatmap(
  mat = Morder,
  fontsize_col = 5, fontsize_row = 10, 
  color = mycolors, 
  breaks = max(abs(Morder)) * seq(-1, +1, length.out = length(mycolors) + 1)
) 

Question

What type of distance and which clustering method does pheatmap use by default for clustering rows and columns?

Euclidean distance and complete linkage clustering (see ?pheatmap)

Note that these are default values for dist and hclust, too. Look at how clustering heatmap changes if you use different distance and clustering methods (e.g. ‘ward.D2’).

# cluster genes
gene_clust_wd2 = hclust(d = dist(t(Morder)), method = "ward.D2")

# cluster samples
sample_clust_wd2 = hclust(d = dist(Morder), method = "ward.D2")

# order expression matrix by clustering results
Morder_clustered = Morder[sample_clust_wd2$order, gene_clust_wd2$order]

# we use pre-computed clustering, so we need to specify that pheatmap should not
# re-cluster the input data
pheatmap(
  mat = Morder_clustered,
  cluster_rows = FALSE, cluster_cols = FALSE,
  fontsize_col = 5, fontsize_row = 10,
  color = mycolors, 
  breaks = max(abs(Morder)) * seq(-1, +1, length.out = length(mycolors) + 1)
) 

Mass cytometry (CyTOF)

Cytometry is a biophysical technology that allows you to measure physical and chemical characteristics of cells. Modern flow and mass cytometry allows for simultaneous multiparametric analysis of thousands of particles per second.

Flow cytometry enables the simultaneous measurement of 15, whereas mass cytometry (CyTOF) of as many as 40 proteins per single cell.

We will be working with the CyTOF dataset from a single-cell mass cytometry study by Bendall et al. on differential immune drug responses across human hematopoietic cells over time.

First we load the necessary packages.

library("tidyverse")
library("flowCore")
library("flowViz")
library("flowPeaks")
library("ggcyto")

Here we download the data.

# download file from web (14.3 Mb)
download.file(
  url = "http://web.stanford.edu/class/bios221/data/Bendall_2011.fcs",
  destfile = "Bendall_2011.fcs",
  mode = "wb"
)
# If download fails with timeout error, try increasing it
# options(timeout = 1000)
# On Linux and MacOS, you can also try to use the 'wget' command in the Unix terminal instead

Load this data in R using flowCore package.

# load file in R - here we use a function from flowCore package
fcsB = flowCore::read.FCS("Bendall_2011.fcs")
Question

Look at the structure of the fcsB object. How many variables and how many cells were measured?

fcsB
flowFrame object '1.fcs'
with 91392 cells and 41 observables:
                    name                desc     range  minRange  maxRange
$P1                 Time                Time      4262    0.0000      4261
$P2          Cell_length         Cell Length       962    0.0000       961
$P3     Ir(190.960)-Dual             191-DNA      1902    0.0000      1901
$P4     Ir(192.962)-Dual             193-DNA       784  -53.4787       783
$P5     Rh(102.905)-Dual       103-Viability      2495  -29.7542      2494
...                  ...                 ...       ...       ...       ...
$P37    Yb(173.938)-Dual           174-HLADR      2272  -99.9504      2271
$P38    Yb(175.942)-Dual            176-CD90      2640  -99.1393      2639
$P39    Lu(174.940)-Dual           175-CXCR4      3699  -99.9671      3698
$P40 Cd(110,111,112,114)         110_114-CD3      5264 -111.0000      5263
$P41 absoluteEventNumber absoluteEventNumber      2418    0.0000      2417
306 keywords are stored in the 'description' slot
# variables
length(colnames(fcsB))
[1] 41
# cells
nrow(exprs(fcsB))
[1] 91392

Inspect slots of the flowFrame object:

slotNames(fcsB)
[1] "exprs"       "parameters"  "description"

Different slots of the flowFrame objects can be accessed with exprs(fcsB), parameters(fcsB) and description(fcsB) functions.

Data preprocessing

First we get the data table that reports the mapping between isotopes and markers (antibodies).

Here we download the table.

# download data from the web
markersB = read_csv(url("http://web.stanford.edu/class/bios221/data/Bendall_2011_markers.csv"))

If you already downloaded the data you can load it locally.

# load data
markersB = read_csv("Bendall_2011_markers.csv")

Now we replace the isotope names in the column names of fcsB with the marker names. This is simply to make the subsequent analysis and plotting code more readable.

# match isotope names to marker names
mt = match(markersB$isotope, colnames(fcsB))

# check that all isotope names were matched
stopifnot(!any(is.na(mt)))

# replace isotope names with marker names
colnames(fcsB)[mt] = markersB$marker

Below, we show how to plot the joint distribution of the cell lengths and the DNA191 (this variable indicates the activity of the cell, i.e. whether the cell is dead or alive):

flowPlot(fcsB, plotParameters = c("Cell_length", "DNA191"), logy = TRUE)

It is standard to transform both flow and mass cytometry data using one of several special functions, we take the example of the inverse hyperbolic sine (arcsinh), which serves as a variance stabilizing transformation.

First, we show the distribution on untransformed raw data:

densityplot(~`CD3all`, fcsB)

To apply the transformation and to plot the data you can use functions from the flowCore package. After the transformation the cells seem to form two clusters. Based solely on one dimension (CD3all) we see two cell subsets (the two modes).

asinhT = arcsinhTransform(a = 0.1, b = 1)
cols_to_transform = setdiff(colnames(fcsB), c("Time", "Cell_length", "absoluteEventNumber"))
trans1 = transformList(cols_to_transform, asinhT)
fcsBT = transform(fcsB, trans1)
densityplot(~`CD3all`, fcsBT)

\(k\)-means

Let’s cluster cells into two groups using one-dimensional \(k\)-means filter. To learn more about the arguments of the functions type ?kmeansFilter and ?flowCore::filter

kf = kmeansFilter("CD3all"=c("Pop1","Pop2"), filterId="myKmFilter")
fres = flowCore::filter(fcsBT, kf)
summary(fres)
Pop1: 33429 of 91392 events (36.58%)
Pop2: 57963 of 91392 events (63.42%)

Split two data clusters into two separate flowFrame objects:

fcsBT1 = split(fcsBT, fres, population="Pop1")
fcsBT2 = split(fcsBT, fres, population="Pop2")

We can also cluster cells using k-means with the flowPeaks function from flowPeaks package. The algorithm for this clustering algorithm is specified in detail in a paper by Ge Y. et al (2012).

dat = data.frame(exprs(fcsBT)[ , c("CD56", "CD3all")])
fp = flowPeaks(dat)

Starting the flow Peaks analysis...

    Task A: compute kmeans...
        step 0, set the intial seeds, tot.wss=17597.8
        step 1, do the rough EM, tot.wss=11900.7 at 0.096298 sec
        step 2, do the fine transfer of Hartigan-Wong Algorithm
                 tot.wss=11846.3 at 0.221561 sec
        ...finished summarization at 0.327 sec

    Task B: find peaks...
finished at 0.348 sec
plot(fp)

Question

How many dimensions (markers) does the above code use to split the data into 4 cell subsets using kmeans?

2 dimensions, based on CD56 and CD3all

Note that here fcsB or fcsBT are not ‘data.frames’ but objects of class ‘flowFrame’. This meansthat you cannot use fcsB and fcsBT (without conversion to data.frame) as inputs to ggplot. However, ‘flowFrame’ objects hold marker expression data together with sample information data, so you could access any variables you need for plotting directly from these objects (and this would be lost by converting them to data.frame). This is why for plottimg now we use a Bioconductor package ggcyto, built on top of ggplot2. This package includes functions for generating visualizations specifically for cytometry data, starting from ‘flowFrame’ objects.

We can look at distributions of measured markers, e.g CD4, by plotting the histogram:

library("ggcyto")
# Untransformed data
ggcyto(fcsB,aes(x = CD4)) + geom_histogram(bins = 90) 

# Transformed data
ggcyto(fcsBT, aes(x = CD4)) + geom_histogram(bins = 90) 

autoplot function plots the density of selected markers.

# ggcyto automatic plotting
autoplot(fcsBT, "CD4")

We can compare the distributions of two markers, e.g. CD4 and CD8.

# 2d density plot
ggcyto(fcsBT, aes(x = CD4, y = CD8)) + geom_density2d(colour="black") 

# hexbin plot
ggcyto(fcsBT, aes(x = CD4, y = CD8)) + geom_hex(bins = 50) 

The same can be achieved with autoplot for two variables.

# ggcyto automatic plotting
autoplot(fcsBT, "CD4", "CD8", bins = 50)

For more details on capabilities of ggcyto refer to the following link

Density based clustering

Data sets such as flow cytometry containing only a few variables (markers) and a large number of observations (cells) are amenable to density clustering. DBSCAN algorithm looks for regions of high density separated by sparse emptier regions. This method has the advantage of being able to cope with clusters that are not necessarily convex (i.e. not blob-shaped). One implementation of such a method is called dbscan, let us look at an example by running the code below.

library("dbscan")
# Select a small subset of 5 protein markers
mc5 = exprs(fcsBT)[, c("CD4", "CD8", "CD20", "CD3all", "CD56")]
res5 = dbscan::dbscan(mc5, eps = .65, minPts = 30)
mc5df = data.frame(mc5)
mc5df$cluster = as.factor(res5$cluster)
Question

How many clusters did dbscan find?

table(mc5df$cluster)

    0     1     2     3     4     5     6     7 
77655   230  5114  4616  3310   207   231    29 

7 clusters were found (0 are unclustered observations).

Note that we do not count the 0 as a cluster, since it contains all cells that were not assigned to any cluster. Looking up the documentation of dbscan (?dbscan) we find that “Points which cannot be assigned to a cluster will be reported as members of the noise cluster 0.”

Question

How many cells were clustered into cluster 3 by dbscan?

sum(mc5df$cluster == 3)
[1] 4616

We can now generate a CD8-vs-CD4 2d-density plot for the cells colored by their assigned cluster labels, computed by dbscan:

ggplot(mc5df, aes(x = CD4, y = CD8, colour = cluster)) + geom_density2d()

And do the same for CD3all and CD20 markers:

ggplot(mc5df,aes(x = CD3all, y = CD20, colour = cluster))+ geom_density2d()

Observe that the nature nature of the clustering is multidimensional, as the projections into two dimensions show overlapping clusters.

Validating and choosing the number of clusters

The clustering methods we have described are tailored to deliver the best grouping of the data under various constrains, however they will always deliver groups, even if there are none. This is important, e.g. when performing kmeans clustering, as we have to set the ‘k’ parameter (for the number of clusters to group observations into) ahead of time. What choice of ‘k’ is valid though?

Here we want to illustate the use of the “wss” (within sum of squares) statistic to evaluate the quality of a clustering. Note that as \(k\) (number of cluster for k-means algorithm) increases, wss will also decrease. We simulate data coming from 4 groups. In particular, we generate 2-dimensional observations (as if there were only 2 proteins measured for each cell). The four groups are generated from 2-d multivariate normals with centers at \(\mu_1 = (0, 0)\), \(\mu_2 = (0, 8)\), \(\mu_3 = (8, 0)\), \(\mu_4 = (8, 8)\). In this simulation, we know the ground truth (4 groups), but we will try to cluster the data using the kmeans argorithm with different choices for the k parameter. We will see how the wss statistic varies as we vary k.

We have used the |> operator (if you do not understand the code, try to see what simul4 contains and repeat the same using code that does not use the |> operator).

simul4 = lapply(c(0,8), function(x){
  lapply(c(0,8), function(y){
    data.frame(x = rnorm(100, x, 2),
               y = rnorm(100, y, 2), 
               class = paste(x, y, sep = "")
    )
  }) |> do.call(what = rbind)
}) |> do.call(what = rbind)
ggplot(simul4, aes(x = x, y = y)) +
  geom_point(aes(color = class), size = 2)

# Compute the kmeans within group wss for k=1 to 12
wss = rep(0,8)
# for a single cluster the WSS statistic is just sum of squares of centered data
wss[1] = sum(apply(scale(simul4[,1:2], scale = F), 2, function(x){ x^2 }))
# for k = 2, 3, ... we perform kmeans clustering and compute the associated WSS statistic
for (k in 2:8) {
  km4 = kmeans(simul4[,1:2],k)
    wss[k] =  sum(km4$withinss)
}
# Now, we are ready to plot the computed statistic:
ggplot(data.frame(k = 1:length(wss), wss = wss)) +
  geom_point(aes(x = k, y = wss), color = "blue", size= 3) +
  xlab('k') + ylab('WSS(k)')

Within sum of squares (wss) statistic, we see that the last substantial decrease of the statistic occurres before \(k=4\), and for values \(k = 5, 6, \dots\) the quantity ‘levels-off’. In practice, we would choose \(k=4\), a value happening at the ‘elbow’ of the plot (elbow-rul). Of course this choice is still somewhat subjective. The book chapter describes additional ways of choosing k (e.g. the gap statistic).

Session info

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Uzhgorod
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dbscan_1.2-0         ggcyto_1.32.0        flowWorkspace_4.16.0
 [4] ncdfFlow_2.50.0      BH_1.84.0-0          flowPeaks_1.50.0    
 [7] flowViz_1.68.0       lattice_0.22-6       flowCore_2.16.0     
[10] lubridate_1.9.3      forcats_1.0.0        stringr_1.5.1       
[13] dplyr_1.1.4          purrr_1.0.2          readr_2.1.5         
[16] tidyr_1.3.1          tibble_3.2.1         ggplot2_3.5.1       
[19] tidyverse_2.0.0      pheatmap_1.0.12     

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1    farver_2.1.2        fastmap_1.2.0      
 [4] XML_3.99-0.17       digest_0.6.36       timechange_0.3.0   
 [7] lifecycle_1.0.4     magrittr_2.0.3      compiler_4.4.1     
[10] rlang_1.1.4         tools_4.4.1         utf8_1.2.4         
[13] yaml_2.3.9          data.table_1.15.4   knitr_1.48         
[16] labeling_0.4.3      htmlwidgets_1.6.4   interp_1.1-6       
[19] bit_4.0.5           plyr_1.8.9          RColorBrewer_1.1-3 
[22] KernSmooth_2.23-24  withr_3.0.0         RProtoBufLib_2.16.0
[25] BiocGenerics_0.50.0 grid_4.4.1          stats4_4.4.1       
[28] fansi_1.0.6         latticeExtra_0.6-30 colorspace_2.1-0   
[31] scales_1.3.0        MASS_7.3-61         isoband_0.2.7      
[34] cli_3.6.3           rmarkdown_2.27      crayon_1.5.3       
[37] generics_0.1.3      tzdb_0.4.0          IDPmisc_1.1.21     
[40] zlibbioc_1.50.0     parallel_4.4.1      matrixStats_1.3.0  
[43] vctrs_0.6.5         jsonlite_1.8.8      cytolib_2.16.0     
[46] hms_1.1.3           S4Vectors_0.42.1    bit64_4.0.5        
[49] Rgraphviz_2.48.0    jpeg_0.1-10         hexbin_1.28.3      
[52] glue_1.7.0          stringi_1.8.4       gtable_0.3.5       
[55] deldir_2.0-4        munsell_0.5.1       pillar_1.9.0       
[58] htmltools_0.5.8.1   graph_1.82.0        R6_2.5.1           
[61] vroom_1.6.5         evaluate_0.24.0     Biobase_2.64.0     
[64] png_0.1-8           Rcpp_1.0.12         gridExtra_2.3      
[67] xfun_0.45           pkgconfig_2.0.3